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Abstract 

Temporal difference (TD) methods constitute a class of methods for learning predictions 
in multi-step prediction problems, parameterized by a recency factor A. Currently the most 
important application of these methods is to temporal credit assignment in reinforcement 
learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may 
be viewed as instances of TD learning. This paper examines the issues of the efficient 
and general implementation of TD(A) for arbitrary A, for use with reinforcement learning 
algorithms optimizing the discounted sum of rewards. The traditional approach, based on 
eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD 
[Truncated Temporal Differences) procedure is proposed as an alternative, that indeed 
only approximates TD(A), but requires very little computation per action and can be used 
with arbitrary function representation methods. The idea from which it is derived is fairly 
simple and not new, but probably unexplored so far. Encouraging experimental results are 
presented, suggesting that using A > with the TTD procedure allows one to obtain a 
significant learning speedup at essentially the same cost as usual TD(O) learning. 



1. Introduction 

Reinforcement learning (RL, e.g., Sutton, 1984; Watkins, 1989; Barto, 1992; Sutton, Barto, 
& Williams, 1991; Lin, 1992, 1993; Cichosz, 1994) is a machine learning paradigm that relies 
on evaluative training information. At each step of discrete time a learning agent observes 
the current state of its environment and executes an action. Then it receives a reinforce- 
ment value, also called a payoff or a reward (punishment), and a state transition takes 
place. Reinforcement values provide a relative measure of the quality of actions executed 
by the agent. Both state transitions and rewards may be stochastic, and the agent does not 
know either transition probabilities or expected reinforcement values for any state-action 
combinations. The objective of learning is to identify a decision policy (i.e., a state-action 
mapping) that maximizes the reinforcement values received by the agent in the long term. 
A commonly assumed formal model of a reinforcement learning task is a Markovian decision 
problem (MDP, e.g., Ross, 1983). The Markov property means that state transitions and 
reinforcement values always depend solely on the current state and the current action: there 
is no dependence on previous states, actions, or rewards, i.e., the state information supplied 
to the agent is sufficient for making optimal decisions. 

All the information the agent has about the external world and its task is contained 
in a series of environment states and reinforcement values. It is never told what actions 
to execute in particular states, or what actions (if any) would be better than those which 

©1995 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. 



ClCHOSZ 



it actually performs. It must learn an optimal policy by observing the consequences of its 
actions. The abstract formulation and generality of the reinforcement learning paradigm 
make it widely applicable, especially in such domains as game-playing (Tesauro, 1992), 
automatic control (Sutton et al., 1991), and robotics (Lin, 1993). To formulate a particular 
task as a reinforcement learning task, one just has to design appropriate state and action 
representation, and a reinforcement mechanism specifying the goal of the task. The main 
limitation of RL applications is that it is by nature a trial-and-error learning method, and 
it is hardly applicable in domains where making errors costs much. 

A commonly studied performance measure to be maximized by an RL agent is the 
expected total discounted sum of reinforcement: 



where r t denotes the reinforcement value received at step t, and < 7 < 1 is a discount 
factor, which adjusts the relative significance of long-term rewards versus short-term ones. 
To maximize the sum for any positive 7, the agent must take into account the delayed 
consequences of its actions: reinforcement values may be received several steps after the 
actions that contributed to them were performed. This is referred to as learning with delayed 
reinforcement (Sutton, 1984; Watkins, 1989). Other reinforcement learning performance 
measures have also been considered (Heger, 1994; Schwartz, 1993; Singh, 1994), but in this 
work we limit ourselves exclusively to the performance measure specified by Equation 1. 

The key problem that must be solved in order to learn an optimal policy under the 
conditions of delayed reinforcement is known as the temporal credit assignment problem 
(Sutton, 1984). It is the problem of assigning credit or blame for the overall outcomes 
of a learning system (i.e., long-term reinforcement values) to each of its individual actions, 
possibly taken several steps before the outcomes could be observed. Discussing reinforcement 
learning algorithms, we will concentrate on temporal credit assignment and ignore the issues 
of structural credit assignment (Sutton, 1984), the other aspect of credit assignment in RL 
systems. 

1.1 Temporal Difference Methods 

The temporal credit assignment problem in reinforcement learning is typically solved using 
algorithms based on the methods of temporal differences (TD). They have been introduced 
by Sutton (1988) as a class of methods for learning predictions in multi-step prediction 
problems. In such problems prediction correctness is not revealed at once, but after more 
than one step since the prediction was made, though some partial information relevant to 
its correctness is revealed at each step. This information is available and observed as the 
current state of a prediction problem, and the corresponding prediction is computed as a 
value of a function of states. 

Consider a multi-step prediction problem where at each step it is necessary to learn a 
prediction of some final outcome. It could be for example predicting the outcome of a game 
of chess in subsequent board situations, predicting the weather on Sunday on each day of 
the week, or forecasting some economic indicators. The traditional approach to learning 
such predictions would be to wait until the outcome occurs, keeping track of all predictions 
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computed at intermediate steps, and then, for each of them, to use the difference between 
the actual outcome and the predicted value as the training error. It is supervised learning, 
where directed training information is obtained by comparing the outcome with predictions 
produced at each step. Each of the predictions is modified so as to make it closer to the 
outcome. 

Temporal difference learning makes it unnecessary to always wait for the outcome. At 
each step the difference between two successive predictions is used as the training error. 
Each prediction is modified so as to make it closer to the next one. In fact, TD is a class 
of methods referred to as TD(A), where < A < 1 is called a recency factor. Using A > 
allows one to incorporate prediction differences from more time steps, to hopefully speed 
up learning. 

Temporal credit assignment in reinforcement learning may be viewed as a prediction 
problem. The outcome to predict in each state is simply the total discounted reinforcement 
that will be received starting from that state and following the current policy. Such predic- 
tions can be used for modifying the policy so as to optimize the performance measure given 
by Equation f . Example reinforcement learning algorithms that implement this idea, called 
TD-based algorithms , will be presented in Section 2.2. 

1.2 Paper Overview 

Much of the research concerning TD-based reinforcement learning algorithms has concen- 
trated on the simplest TD(0) case. However, experimental results obtained with TD(A > 0) 
indicate that it often allows one to obtain a significant learning speedup (Sutton, f988; 
Lin, f993; Tesauro, f992). It has been also suggested (e.g., Peng & Williams, f994) that 
TD(A > 0) should perform better in non-Markovian environments than TD(0) (i.e., it should 
be less sensitive to the potential violations of the Markov property). It is thus important 
to develop efficient and general implementation techniques that would allow TD-based RL 
algorithms to use arbitrary A. This has been the motivation of this work. 

The remainder of this paper is organized as follows. In Section 2 a formal definition of 
TD methods is presented and their application to reinforcement learning is discussed. Three 
example RL algorithms are briefly described: AHC (Sutton, f984), Q-learning (Watkins, 
f989; Watkins & Dayan, f992), and advantage updating (Baird, f993). Section 3 presents 
the traditional approach to TD(A) implementation, based on so called eligibility traces, 
which is criticized for inefficiency and lack of generality. In Section 4 the analysis of the 
effects of the TD algorithm leads to the formulation of the TTD {Truncated Temporal 
Differences) procedure. The two remaining sections are devoted to experimental results 
and concluding discussion. 

2. Definition of TD(A) 

When Sutton (f988) introduced TD methods, he assumed they would use parameter es- 
timation techniques for prediction representation. According to his original formulation, 
states of a prediction problem are represented by vectors of real-valued features, and corre- 
sponding predictions are computed by the use of a set of modifiable parameters (weights). 
Under such representation learning consists in adjusting the weights appropriately on the 
basis of observed state sequences and outcomes. Below we present an alternative formula- 
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tion, adopted from Dayan (1992), that simplifies the analysis of the effects of the TD(A) 
algorithm. In this formulation states may be elements of an arbitrary finite state space, and 
predictions are values of some function of states. Transforming Sutton's original definition 
of TD(A) to this alternative form is straightforward. 

When discussing either the generic or RL-oriented form of TD methods, we conse- 
quently ignore the issues of function representation. It is only assumed that TD predic- 
tions or functions maintained by reinforcement learning algorithms are represented by a 
method that allows adjusting function values using some error values, controlled by a learn- 
ing rate parameter. Whenever we write that the value of an ra-argument function ip for 
arguments poj Pi, ■ ■ ■ > Pn-i should be updated using an error value of A, we mean that 
Lp(po,Pi, . . should be moved towards Lp(po,pi, . . . ,p n _i) + A, to a degree controlled 

by some learning rate factor to The general form of this abstract update operation is written 
as 

update"' '((p, p ,pi, .. .,p n -i, £)■ (2) 

Under this convention, a learning algorithm is defined by the rule it uses for computing 
error values. 



2.1 Basic Formulation 

Let xo, x\, . . . , x m _i be a sequence of to states of a multi-step prediction problem. Each 
state x t can be observed at time step t, and at step to, after passing the whole sequence, a 
real- valued outcome z can be observed. The learning system is required to produce a corre- 
sponding sequence of predictions P(xo), P(xi), . . . , P(x m _i), each of which is an estimate 
of z. 

Following Dayan (1992), let us define for each state x: 




1 if xt = x 
otherwise. 



Then the TD(A) prediction error for each state x determined at step t is given by: 

t 

A x {t) = (P(x t+1 ) - P{x t )) J2 A*-*x*(*0, (3) 

k=0 

where < A < 1 and P(x m ) = z by definition, and the total prediction error for state x 
determined after the whole observed sequence accordingly is: 

m — 1 m — 1 ( ^1 

A X =J2 A x (t) = (P(*t+i) ~ P(*t)) E ^~ k xAk) ■ (4) 

t=0 t=0 I k=0 ) 

Thus, learning at each step is driven by the difference between two temporally successive 
predictions. When A > 0, the prediction difference at time t affects not only P(x t ), but also 
predictions from previous time steps, to an exponentially decaying degree. 1 

1. Alternatively, learning the prediction at step t relies not only on the prediction difference from that 
step, but also on future prediction differences. This equivalent formulation will play a significant role in 
Section 4. 
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There are two possibilities of using such defined errors for learning. The first is to com- 
pute total errors A x for all states x, by accumulating the A x (t) errors computed at each time 
step t, and to use them after passing the whole state sequence to update predictions P(x). 
It corresponds to batch learning mode. The second possibility, called incremental or on-line 
learning, often more attractive in practice, is to update predictions at each step t using 
current error values A x (t). It is then necessary to modify appropriately Equation 3, so as 
to take into account that predictions are changed at each step: 

t 

A x {t) = (P t (x t+1 ) - P t (x t )) ^~ k xAk), (5) 

k=0 

where Pt(x) designates the prediction for state x available at step t. 

Sutton (1988) proved the convergence of batch TD(0) for a linear representation, with 
states represented as linearly independent vectors, under the assumption that state se- 
quences are generated by an absorbing Markov process. 2 Dayan (1992) extended his proof 
to arbitrary A. 3 

2.2 TD(A) for Reinforcement Learning 

So far, this paper has presented TD as a general class of prediction methods for multi-step 
prediction problems. The most important application of these methods, however, is to rein- 
forcement learning. As a matter of fact, TD methods were formulated by Sutton (1988) as 
a generalization of techniques he had previously used only in the context of temporal credit 
assignment in reinforcement learning (Sutton, 1984). 

As already stated above, the most straightforward way to formulate temporal credit 
assignment as a prediction problem is to predict at each time step t the discounted sum of 
future reinforcement 

oo 
k=0 

called the TD return for time t. The corresponding prediction is designated by U{x t ) and 
called the predicted utility of state x t . TD returns obviously depend on the policy being 
followed; we therefore assume that U values represent predicted state utilities with respect 
to the current policy. For perfectly accurate predictions we would have: 

U (x t ) = z t = r t + jz t+1 = r t + jU (x t+1 ). 

Thus, for inaccurate predictions, the mismatch or TD error is r t + ~(U{x t +i) — U{x t ). The 
resulting RL-oriented TD(A) equations take form: 

t 

Mt) = (rt + 7U t (x t+1 ) - U t (x t )) Ysiy^xAk) (6) 

k=0 

2. An absorbing Markov process is defined by a set of terminal states Xt, a set of non-terminal states Xn, 
and the set of transition probabilities P xy for all x 6 Xn and y £ Xn L)Xt- The absorbing property 
means that any cycles among non-terminal states cannot last indefinitely long, i.e., for any starting 
non-terminal state a terminal state will eventually be reached (all sequences eventually terminate). 

3. Recently stronger theoretical results were proved by Dayan and Sejnowski (1994) and Jaakkola, Jordan, 
and Singh (1993). 
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and 

oo oo ( t ^ 

A X = J2 Mt) = E + lU t {x t+l ) - U t (x t )) ^(y^xAk) ■ (7) 

t=0 t=0 I k=0 ) 

Note the following additional differences between these equations and Equations 3 and 4: 

• time step subscripts are used with U values to emphasize on-line learning mode, 

• the discount applied in the sum in Equation 6 includes j as well as A for reasons that 
may be unclear now, but will be made clear in Section 4.1, 

• the summation in Equation 7 extends to infinity, because the predicted final outcome 
is not, in general, available after any finite number of steps. 

TD-based reinforcement learning algorithms may be viewed as more or less direct im- 
plementations of the general rule described by Equation 6. To see this, we will consider 
three algorithms: well known AHC (Sutton, 1984) and Q-learning (Watkins, 1989; Watkins 
& Dayan, 1992), and a recent development of Baird (1993) called advantage updating. All 
the algorithms rely on learning certain real-valued functions defined over the state or state 
and action space of a task. The * superscript used with any of the described functions 
designates its optimal values (i.e., corresponding to an optimal policy). Simplified versions 
of the algorithms, corresponding to TD(0), will be presented and related to Equation 6. 
The presentation below is limited solely to function update rules — for a more elaborated 
description of the algorithms the reader should consult the original publications of their 
developers or, for AHC and Q-learning, Lin (1993) or Cichosz (1994). They are all closely 
related to dynamic programming methods (Barto, Sutton, & Watkins, 1990; Watkins, 1989; 
Baird, 1993), but these relations, though theoretically and practically important and fruit- 
ful, are not essential for the subject of this paper and will not be discussed. 

2.2.1 The AHC Algorithm 

The variation of the AHC algorithm described here is adopted from Sutton (1990). Two 
functions are maintained: an evaluation function V and a policy function f. The evaluation 
function evaluates each environment state and is essentially the same as what was called 
above the U function, i.e., V(x) is intended to be an estimate of the discounted sum of 
future reinforcement values received starting from state x and following the current policy. 
The policy function assigns to each state- action pair (x,a) a real number representing 
the relative merit of performing action a in state x, called the action merit. The actual 
policy is determined from action merits using some, usually stochastic, action selection 
mechanism, e.g., according to a Boltzmann distribution (as described in Section 5). The 
optimal evaluation of state x, V*(x), is the expected total discounted reinforcement that 
will be received starting from state x and following an optimal policy. 

Both the functions are updated at each step t, after executing action a^ in state x^, 
according to the following rules: 

update" {V, x tl r t + yV t (x t+1 ) - V t (x t )); 

update 13 (f, x t ,a t , r t + jV t (x t+1 ) - V t (x t )). 
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The update rule for the ^-function directly corresponds to Equation 6 for A = 0. The update 
rule for the policy function increases or decreases the action merit of an action depending 
on whether its long-term consequences appear to be better or worse than expected. We 
present this, a simplified form of AHC corresponding to TD(0), because this paper proposes 
an alternative way of using TD(A > 0) to that implemented by the original AHC algorithm 
presented by Sutton (1984). 

2.2.2 The Q-Learning Algorithm 

Q-learning learns a single function of states and actions, called a Q-function. To each 
state- action pair (x, a) it assigns a Q -value or action utility Q(x,a), which is an estimate of 
the discounted sum of future reinforcement values received starting from state x by executing 
action a and then following a greedy policy with respect to the current Q-function (i.e., 
performing in each state actions with maximum Q-values). The current policy is implicitly 
defined by Q-values. When the optimal Q-function is learned, then a greedy policy with 
respect to action utilities is an optimal policy. 
The update rule for the Q-function is: 

update°(Q, x t ,a t , r t + 7max a Q t (x t+1 , a) - Q t (x t , a t )). 

To show its correspondence to the TD(0) version of Equation 6, we simply assume that 
predicted state utilities are represented by Q-values so that Qt(xt, a t ) corresponds to Ut{x t ) 
and max„ Qt(%t+i, a) corresponds to Ut{x t +i). 

2.2.3 The Advantage Updating Algorithm 

In advantage updating two functions are maintained: an evaluation function V and an 
advantage function A. The evaluation function has essentially the same interpretation as its 
counterpart in AHC, though it is learned in a different way. The advantage function assigns 
to each state-action pair (x,a) a real number A(x,a) representing the degree to which the 
expected discounted sum of future reinforcement is increased by performing action a in 
state x, relative to the action currently considered best in that state. The optimal action 
advantages are negative for all suboptimal actions and equal for optimal actions, and can 
be related to the optimal Q-values by: 

A*(x, a) = Q*(x, a) — m&xQ*(x, a'). 

a' 

Similarly as action utilities, action advantages implicitly define a policy. 

The evaluation and advantage functions are updated at step t by applying the following 
rules: 

update" (A, x t , a t , max a A t (x t , a) - A t (x t , a t ) + r t + jV t (x t+1 ) - V t (x t )); 
update fj (V, x t , ^[max a A t+l (x t ) - max a A t (x t )}). 

The update rule for the advantage function is somewhat more complex that the AHC or 
Q-learning rules, but it still contains a term that directly corresponds to the TD(0) form of 
Equation 6, by replacing V with U. 

Actually, what has been presented above is a simplified version of advantage updating. 
The original algorithm differs in two details: 
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• the time step duration At is explicitly included in the update rules, while in this 
presentation we assumed At = 1, 

• besides learning updates, described above, so called normalizing updates are per- 
formed. 

3. Eligibility Traces 

It is obvious that the direct implementation of the computation described by Equation 6 is 
not too tempting. It requires maintaining Xx(t) values for each state x and past time step t. 
Note, however, that one only needs to maintain the whole sums Y^k=o(l ^Y~ k Xx(k) for all x 
and only one (current) t, which is much easier due to a simple trick. Substituting 



e x (t) = 1 £( 7 X) t - k x x (k), 



k=0 

we can define the following recursive update rule: 



e x (0) 
e x {t) 



1 if xq = x 
otherwise, 

jXe x (t - 1) + 1 if x t = x 
jXe x (t — 1) otherwise. 



The quantities e x (t) defined this way are called activity or eligibility traces (Barto, 
Sutton, & Anderson, 1983; Sutton, 1984; Watkins, 1989). Whenever a state is visited, its 
activity becomes high and then gradually decays until it is visited again. The update to 
the predicted utility of each state x resulting from visiting state x t at time t may be then 
written as 

A x {t) = (r t + jU t (x t+1 ) - U t (x t ))e x (t), (9) 

which is a direct transformation of Equation 6. 

This technique (with minor differences) was already used in the early works of Barto 
et al. (1983) and Sutton (1984), before the actual formulation of TD(A). It is especially 
suitable for use with parameter estimation function representation methods, such as con- 
nectionist networks. Instead of having one e x value for each state x one then has one e 4 - 
value for each weight W{. That is how eligibility traces were actually used by Barto et al. 
(1983) and Sutton (1984), inspired by an earlier work of Klopf (1982). Note that in the case 
of the AHC algorithm, different A values may be used for maintaining traces used by the 
evaluation and policy functions. 

Unfortunately, the technique of eligibility traces is not general enough to be easy to im- 
plement with an arbitrary function representation method. It is not clear, for example, how 
it could be used with such an important class of function approximators as memory-based 
(or instance-based) function approximators (Moore & Atkeson, 1992). Applied with a pure 
tabular representation, it has significant drawbacks. First, it requires additional memory lo- 
cations, one per state. Second, and even more painful, is that it requires modifying both U(x) 
and e x for all x at each time step. This operation dominates the computational complexity 



294 



Truncating Temporal Differences 



of TD-based reinforcement learning algorithms, and makes using TD(A > 0) much more ex- 
pensive than TD(0). The eligibility traces implementation of TD(A) is thus, for large state 
spaces, absolutely impractical on serial computers, unless an appropriate function approx- 
imator is used that allows updating function values and eligibility traces for many states 
concurrently (such as a multi-layer perceptron). But even when such an approximator is 
used, there are still significant computational (both memory and time) additional costs of 
using TD(A) for A > versus TD(0). Another drawback of this approach will be revealed 
in Section 4.1. 

4. Truncating Temporal Differences 

This section departs from an alternative formulation of TD(A) for reinforcement learning. 
Then we follow with relating the TD(A) training errors used in this alternative formulation 
to TD(A) returns. Finally, we propose approximating TD(A) returns with truncated TD(A) 
returns, and we show how they can be computed and used for on-line reinforcement learning. 

4.1 TD Errors and TD Returns 

Let us take a closer look at Equation 7. Consider the effects of experiencing a sequence of 
states xo, x\, . . . , x k , ■ ■ ■ and corresponding reinforcement values ro, r\, . . . , r k , . . .. For the 
sake of simplicity, assume for a while that all states in the sequence are different (though it 
is of course impossible for finite state spaces). Applying Equation 7 to state x t under this 
assumption we have: 

= r t + jU t (x t+1 ) - U t (x t ) + 

7A[r f+ i + jU t+1 (x t+2 ) - U t+ i(x t+1 )] + 
(yX) 2 [r t+2 + jU t+2 (x t+3 ) - U t+2 (x t+2 )^ + ... 

CO 

= ^2(l^) k [ r t+k + lU t+ k{xt+k+i) ~ U t+ k{xt+k)\ ■ 

k=0 

If a state occurs several times in the sequence, each visit to that state yields a similar update. 
This simple observation opens a way to an alternative (though equivalent) formulation of 
TD(A), offering novel implementation possibilities. 
Let 

A° t = r t + jU t (x t+1 ) - Ut(x t ) (10) 

be the TD(0) error at time step t. We define the TD(\) error at time t using TD(0) errors 
as follows: 

oo oo 

A A = E(TA) fc [n +fc + 7 U t+k (x t+k+1 ) - U t+k (xt+k)} = J2(7 X ) kA t+k- (11) 

k=0 k=0 

Now, we can express the overall TD(A) error for state x, A x , in terms of A^ errors: 

oo 

A x = J2^tXAt). (12) 
t=o 
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In fact, from Equation 7 we have: 

oo t oo t 

A X = J2 A°J2(^y- k xAk) = EE^'-^W- (13) 

t=0 k=0 t=0 k=0 

Swapping the order of the two summations we get: 

oo oo 

4 = EE(^)'"^(*)- ( 14 ) 

k=0 t=k 

Finally, by exchanging k and t with each other, we receive: 

oo oo oo oo oo 

= £:C(7A) fc -M° x ^) = £5>A)M° +JfeXa: (t) = E A hAt). (15) 

t=0 k=t t=0 k=0 t=0 

Note the following important difference between A x (t) (Equation 6) and A^\ the former 
is computed at each time step t for all x and the latter is computed at each step t only 
for x t . Accordingly, at step t the error value A x (t) is used for adjusting U(x) for all x 
and A^ is only used for adjusting U(x t ). This is crucial for the learning procedure proposed 
in Section 4.2. While applying such defined A^ errors on-line makes changes to predicted 
state utilities at individual steps clearly different than those described by Equation 6, the 
overall effects of experiencing the whole state sequence (i.e., the sums of all individual error 
values for each state) are equivalent, as shown above. 

Having expressed TD(A) in terms of A^ errors, we can gain more insight into its opera- 
tion and the role of A. Some definitions will be helpful. Recall that the TD return for time t 
is defined as 

oo 
k=0 

The m-step truncated TD return (Watkins, 1989; Barto et al., 1990) is received by taking 
into account only the first m terms of the above sum, i.e., 

m — 1 
[m] \ ^ k 

z t = 2^ 7 r t+k- 

k=0 

Note, however, that the rejected terms j m rt +m + j m+1 rt +m+ i + . . . can be approximated by 
y' m Ut+ m -i (xt +m ). The corrected m-step truncated TD return (Watkins, 1989; Barto et al., 
1990) is thus: 

m — 1 

4 m) = E 7S+fc + 7 m ^+ m -i(^+ m )- 

k=0 

Equation 11 may be rewritten in the following form: 

oo 

t+k+l) + J^Ut+k{xt+k+i) — Ut+k{%t+k)\ 

k=0 

oo 

= E(7 A ) fc [ r *+fc + 7(l- X)U t+ k(xt+k+i)} -U t (x t ) + 

k=0 

oo 

J2( 7 ^ k [Ut + k-i(xt + k) - U t+k (x t+k )]. (16) 
k=i 
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Note that for A = 1 it yields: 

oo oo 

A \ = X^S+fc - U t (x t ) + J2l k [ u t+k-i(x t +k) ~ U t+ k(x t+ k)] 

k=0 k=l 

oo 

= z t - U t (x t ) + l k [u t +k-i(x t +k) ~ U t+ k(x t +k)]- 
k=i 

If we relax for a moment our assumption about on-line learning mode and leave out time 
subscripts from U values, the last term disappears and we simply have: 

A] = z t -U{x t ). 

Similarly for general A, if we define the TD{X) return (Watkins, 1989) for time ( as a 
weighted average of corrected truncated TD returns: 

oo oo 

z t = (1 - A) £ A fc 4 fc+1) = ]>>A) fc [r t+fc + 7(1 - X)U t+k (x t+k+1 )] (17) 

k=0 k=0 

and again omit time subscripts, we will receive: 

A$ = z?-U(x t ). (18) 

The last equation brings more light on the exact nature of the computation performed 
by TD(A). The error at time step t is the difference between the TD(A) return for that step 
and the predicted utility of the current state, that is, learning with that error value will 
bring the predicted utility closer to the return. For A = 1 the quantity z\ is the usual TD 
return for time t, i.e., the discounted sum of all future reinforcement values. 4 For A < 1 the 
term r t+ k is replaced by r t+ k + 7(1 — \)Ut+k( x t+k+i), that is, the actual immediate reward 
is augmented with the predicted future reward. 

The definition of the TD(A) return (Equation 17) may be written recursively as 

z t =rt + 7(A^ A +1 + (1 - \)U t (x t+1 )). (19) 

This probably best explains the role of A in TD(A) learning. It determines how the return 
used for improving predictions is obtained. When A = 1, it is exactly the actual observed 
return, the discounted sum of all rewards. For A = it is the 1-step corrected truncated 
return, i.e., the sum of the immediate reward and the discounted predicted utility of the 
successor state. Using < A < 1 allows to smoothly interpolate between these two extremes, 
relying partially on actual returns and partially on predictions. 

Equation 18 holds true only for batch learning mode, but in fact TD methods have been 
originally formulated for batch learning. The incremental version, more practically useful, 

4. This observation corresponds to the equivalence of "generic" TD(A) for A = 1 to supervised learning 
shown by Sutton (1988). To receive such a result it was necessary to discount prediction differences with 
7A instead of A alone in Equation 6, though Sutton presenting the RL-oriented form of TD did not make 
this modification. 
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introduces an additional term. Let designate that term. By comparing Equations 16 
and 17 we get: 

oo 

Dt = A" " {*t - U t (x t )) = Y,(l*) k [Ut+k-i(x t+ k) ~ U t+k (x t+k )]. (20) 

k=l 

The magnitude of this discrepancy term, and consequently its influence on the learning 
process, obviously depends on the learning rate value. To examine it further, suppose a 
learning rate rj is used when learning U on the basis of A x errors. Let the corresponding 
learning rule be: 

U t+1 (x t ) := U t (x t ) + riA$. 

Then we have 

U t+1 {x t ) - U t {x t ) = V (zf - U t {x t )) + vDt 

OO 

= T](z x - U t (x t )) + ri^2(jX) k [u t+ k-i(x t+ k) ~ U t+ k(x t+ k)\ 
k=i 

oo 

< fi(z x -U t (z t ))-T, 2 Y,WA$+k-i, (21) 

k=l 

with equality if and only if x t +k = ^t+fc-i for all k. A similar result may be obtained for the 
eligibility traces implementation, with learning driven by A x (t) errors defined by Equation 9. 
We would then have: 

oo 

U t+1 (x t ) - U t (x t ) = V (z x - U t (x t )) - V 2 T,ti X ) kA t+k-i e * t+k ( t + k ~ !)• (22) 

k=i 

This effect may be considered another drawback of the eligibility traces implementation of 
TD(A), apart from its inefficiency and lack of generality. Though for small learning rates 
the effect of D x is negligible, it may be still harmful in some cases, especially for large 7 
and A. 5 

4.2 The TTD Procedure 

We have shown that TD errors A x or z x — Ut{x t ) can be used almost equivalently for TD(A) 
learning, yielding the same overall results as the eligibility traces implementation, which has, 
however, important drawbacks in practice. Nevertheless, it is impossible to use either TD(A) 
errors A$ or TD(A) returns z x for on-line learning, since they are not available. At step t 
the knowledge of both r t+ k and x t +k is required for all k = 1,2,..., and there is no way to 
implement this in practice. Recall, however, the definition of the truncated TD return. Why 
not define the truncated TD(A) error and the truncated TD(A) return? The appropriate 
definitions are: 

m — 1 

A A ' m = E(TA) fc < fc (23) 

k=0 

5. Sutton (1984) presented the technique of eligibility traces as an implementation of the recency and 
frequency heuristics. In this context, the phenomenon examined above may be considered a harmful 
effect of the frequency heuristic. Sutton discussed an example finite-state task where this heuristic might 
be misleading (Sutton, 1984, page 171). 
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and 

m-2 

h' m = E(7A) fc [rt +fc + 7(l 

A;=0 
m — 1 

= ]T( 7 A) fc [r t+fc + 7 (l 

k=0 

We call A^ ,m the m-step truncated TD(A) error, or simply the TTD(\,m) error at time 
step t, and z^ ,m the m-step truncated TD(A) return, or the TTD(\,m) return for time t. 
Note that z t ' m defined by Equation 24 is corrected, i.e., it is not obtained by simply trun- 
cating Equation 17. The correction term (jX)' m Ut-\- m -i (xt +m ) results in multiplying the 
last prediction Ut+m-i ( x t+m) by 7 alone instead of 7(1 — A), which is virtually equivalent 
to using A = for that step. It is done in order to include in z^ ,m all the available infor- 
mation about the expected returns for further time steps (t + m, t + m + 1, . . .) contained 
in Ut+m-i( x t+m) • Without this correction for large A this information would be almost 
completely lost. 

So defined, m-step truncated TD(A) errors or returns, can be used for on-line learning 
by keeping track of the last m visited states, and updating at each step the predicted 
utility of the least recent state of those m states. This idea leads to what we call the TTD 
Procedure {Truncated Temporal Differences), which can be a good approximation of TD(A) 
for sufficiently large m. The procedure is parameterized by A and m values. An m-element 
experience buffer is maintained, containing records (xt-k-i a t-k-i r t-k-,Ut-k( x t-k+i)) f° r a U 
k = 0, 1, . . . , m — 1, where t is the current time step. At each step t by writing x^, a^j, 
r[j.], and u^ we refer to the corresponding elements of the buffer, storing x t -k, dt-k, r t-k, 
and Ut-k{xt-k+i)- 6 References to U are not subscripted with time steps, since all of them 
concern the values available at the current time step — in a practical implementation this 
directly corresponds to restoring a function value from some function approximator or a 
look-up table. Under this notational convention, the operation of the TTD(A, m) procedure 
is presented in Figure 1. It uses TTD(A, m) returns for learning. An alternative version, using 
TTD(A,to) errors instead (based on Equation 11), is also possible and straightforward to 
formulate, but there is no reason to use a "weaker" version (subject to the harmful effects 
described by Equations 20 and 21) when a "stronger" one is available at the same cost. 

At the beginning of learning, before the first m steps are made, no learning can take 
place. During these initial steps the operation of the TTD procedure reduces to updating 
appropriately the contents of the experience buffer. This obvious technical detail was left 
out in Figure 1 for the sake of simplicity. 

The TTD(A,to) return value z is computed in step 5 by the repeated application of 
Equation 19. The computational cost of such propagating the return in time is acceptable 
in practice for reasonable values of m. For some function representation methods, such 
as neural networks, the overall time complexity is dominated by the costs of retrieving a 
function value and learning performed in steps 4 and 6, and the cost of computing z is 
negligible. One advantage of such implementation is that it allows to use adaptive A values: 
in step 5 one can use A^ depending on whether was or was not a non-policy action, or 

6. This naturally means that the buffer's indices are shifted appropriately on each time tick. 



- \)U t+ k{xt+k+i)\ + (l^)' m 1 [r t+m _ 1 + jUt +m -i(x t+m )] 

- X)Ut + k(x t+ k+i)} + (7\) m Ut +m -i(xt+m). (24) 
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At each time step t: 

1. observe current state x t ; x^ := x t ; 

2. select an action a t for state x t ; a[ ] := a u 

3. perform action a t ; observe new state x t+ \ and immediate reinforcement r t ; 

4. r [0] := r t ; m [0] := C/(a; t+ i); 

5. for k = 0, 1, . . . , m — 1 do 

if k = then z := r^j + 7M[^ 

else 2: := r [k] +j(Xz+ (1 - X)u [k] ); 

6. update v (U, Z[ m _i], a [m _i], 2 - C/(z[ m -i])); 

7. shift the indices of the experience buffer. 



Figure 1: The TTD(A,to) procedure. 

"how much" non-policy it was. This refinement to the TD(A) algorithm was suggested by 
Watkins (1989) or recently Sutton and Singh (1994). Later we will see how the TTD return 
computation can be performed in a fully incremental way, using constant time at each step 
for arbitrary m. 

Note that the function update carried out in step 6 at time t applies to the state and 
action from time t — m + 1, i.e., m — 1 time steps earlier. This delay between an experience 
event and learning might be found a potential weakness of the presented approach, especially 
for large m. Note, however, that as a baseline in computing the error value the current utility 
U(x[ m _i]) = Ut{x t - m +i) is used. This is an important point, because it guarantees that 
learning will have the desired effect of moving the utility (whatever value it currently has) 
towards the corresponding TTD return. If the error used in step 6 were z — Ut- m { x t-m+i) 
instead of z — Ut{x t - m +i)i then applying it to learning at time t would be problematic. 
Anyway, it seems that m should not be too large. 

The TTD procedure is not an exact implementation of TD methods for two reasons. 
First, it only approximates TD(A) returns with TTD(A,to) returns. Second, it introduces 
the aforementioned delay between experience and learning. I believe, however, that it is 
possible to give strict conditions under which the convergence properties of TD(A) hold 
true for the TTD implementation. 

4.2.1 Choice of m 

The reasonable choice of m obviously depends on A. For A = the best possible is m = 1 
and for A = 1 and 7 = 1 no finite value of m is large enough to accurately approximate 
TD(A). Fortunately, this does not seem to be very painful. It is rather unlikely that in any 
application one wanted to use the combination of A = 1 and 7=1, the more so as existing 
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previous empirical results with TD(A) indicate that A = 1 is usually not the optimal value 
to use, and it is at best comparable with other, smaller values (Sutton, 1984; Tesauro, 1992; 
Lin, 1993). Similar conclusions follow from the discussion of the choice of A presented by 
Watkins (1989) or Lin (1993). For A < 1 or j < 1 we would probably like to have such a 
value of to that the discount (7A)" 1 is a small number. One possible definition of 'small' 
here could be, e.g., 'much less than 7A'. This is obviously a completely informal criterion. 
Table 1 illustrates the practical effects of this heuristic. On the other hand, for too large to, 
the delay between experience and learning introduced by the TTD procedure might become 
significant and cause some problems. Some of the experiments described in Section 5 have 
been designed in order to test different values of to for fixed < A < 1. 



7 A 


0.99 


0.975 


0.95 


0.9 


0.8 


0.6 


min{TO (7A)" 1 < jqTA} 


231 


92 


46 


23 


12 


6 



Table 1: Choosing to: an illustration. 



4.2.2 Reset Operation 

Until now, we have assumed that the learning process, once started, continues infinitely 
long. This is not true for episodic tasks (Sutton, 1984) and for many real-world tasks, 
where learning must usually stop some time. This imposes the necessity of designing a 
special mechanism for the TTD procedure, that will be called the reset operation. The reset 
operation would be invoked after the end of each episode in episodic tasks, or after the 
overall end of learning. 

There is not very much to be done. The only problem that must be dealt with is that the 
experience buffer contains the record of the last to steps for which learning has not taken 
place yet, and there will be no further steps that would make learning for these remaining 
steps possible. The implementation of the reset operation that we find the most natural 
and coherent with the TTD procedure is then to simulate to additional fictious steps, so 
that learning takes place for all the real steps left in the buffer, and their TTD returns 
remain unaffected by the simulated fictious steps. The corresponding algorithm, presented 
in Figure 2, is formulated as a replacement of the original algorithm from Figure 1 for the 
final time step. At the final step, when there is no successor state, the fictious successor 
state utility is assumed to be 0. This corresponds to assigning to M[ j. The actual reset 
operation is performed in step 5. 

4.2.3 Incremental TTD 

As stated above, the cost of iteratively computing the TTD(A, to) return is relatively small 
for reasonable to, and with some function representation methods, for which restoring and 
updating function values is computationally expensive, may be really negligible. We also 
argued that reasonable values of to should not be too large. On the other hand, such iterative 
return computation is easy to understand and reflects well the idea of TTD. That is why 
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At the final time step t: 

1. observe current state x t ; x^ := x t ; 

2. select an action a t for state x t ; a[ ] := a u 

3. perform action a t ; observe immediate reinforcement r t ; 

4. r [0] := r t ; m [0] := 0; 

5. for ko = 0, 1, . . . , to — 1 do 

(a) for k = fco, ko + 1, . . . , to — 1 do 

if A; = A; then z := r^j + 7M[^ 
else z := r [k] + j(Xz + (1 - X)u [k] ); 

(b) update 11 ^, £[ m _i], a[ m _i], 2 - £/(z[ m _i])); 

(c) shift the indices of the experience buffer. 



Figure 2: The reset operation for the TTD(A,to) procedure. 

we presented the TTD procedure in that form. It is possible, however, to compute the 
TTD(A, to) return in a fully incremental manner, using constant time for arbitrary to. 

To see this, note that the definition of the TTD(A,to) return (Equation 24) may be 
rewritten in the following form: 

m — 1 m — 2 

4' m = E (TA) Vfc + E (7A)Ml - X)U t+k (x t+k+1 ) + (TAr-'TOt+m-iOst+m) 

fc=0 fc=0 



S X ' m + T X ' m + W X ' m , 



where 



m — 1 



fc=0 
m-2 

T t Km = 52fr\fy(l-\)U t+k (z t+k+1 ), 

k=0 

W X,m can be directly computed in constant time for any to. It is not difficult to convince 
oneself that: 

■Sfr? = ^[S x ' m -r t + ( 7 xrr t+m ], (25) 
T t +7 = ^[T t X ' m - 1 (l-X)U t (x t+1 ) + (l-X)W t X ' m ]. (26) 



302 



Truncating Temporal Differences 



The above two equations define the algorithm for computing incrementally S t ' m and T t ,m , 
and consequently computing z t ' m in constant time for arbitrary to, with a very small com- 
putational expense. This algorithm is strictly mathematically equivalent to the algorithm 
presented in Figure l. 7 Modifying appropriately the TTD procedure is straightforward and 
will not be discussed. A drawback of this modification is that it probably does not allow 
the learner to use different (adaptive) A values at each step, i.e., it may not be possible to 
combine it with the refinements suggested by Watkins (1989) or Sutton and Singh (1994). 
Despite this, such implementation might be beneficial if one wanted to use really large to. 

4.2.4 TTD-Based Implementations of RL Algorithms 

To implement particular TD-based reinforcement learning algorithms on the basis of the 
TTD procedure, one just has to substitute appropriate function values for U, and define 
the updating operation of step 6 in Figure 1 and step 5b in Figure 2. Specifically, for the 
three algorithms outlined in Section 2.2 one should: 

• for AHC: 

1. replace U{x t +i) with V(x t+ i) in step 4 (Figure 1); 

2. implement step 6 (Figure 1) and step 5b (Figure 2) as: 

v := V {x [m _i\); 
update a (V, £[ m _i], z — v); 
update p (f, Z[ m _i],a[ m _i], z - v); 

• for Q-learning: 

1. replace U{x t +i) with max a Q(x t+ i, a) in step 4 (Figure 1); 

2. implement step 6 (Figure 1) and step 5b (Figure 2) as: 

update a (Q, X[ m _ x] , a [m _i], z - Q{x[ m _ x] , a[ m _i])); 

• for advantage updating: 

1. replace U{x t +i) with V(x t+ i) in step 4 (Figure 1); 

2. implement step 6 (Figure 1) and step 5b (Figure 2) as: 

^max . = maXa A(x [m _ x] , a); 

update a (A, Z[ m _i], «[ m _i], A max - A(x [m _ 1] , a t ) + z - V (x^^)); 
update fj (V, x [m _ x] , ^[max a Aix^^ - A max ]). 

4.3 Related Work 

The simple idea of truncating temporal differences that is implemented by the TTD proce- 
dure is not new. It was probably first suggested by Watkins (1989). This paper owes much 
to his work. But, to the best of my knowledge, this idea has never been explicitly and 

7. But it is not necessarily numerically equivalent, which may sometimes cause problems in practical 
implementations. 
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exactly specified, implemented, and tested. In this sense the TTD procedure is an original 
development. 

Lin (1993) used a very similar implementation of TD(A), but only for what he called 
experience replay, and not for actual on-line reinforcement learning. In his approach a se- 
quence of past experiences is replayed occasionally, and during replay for each experience 
the TD(A) return (truncated to the length of the replayed sequence) is computed by ap- 
plying Equation 19, and a corresponding function update is performed. Such a learning 
method is by some means more computationally expensive than the TTD procedure (es- 
pecially implemented in a fully incremental manner, as suggested above), since it requires 
updating predictions sequentially for all replayed experiences, besides "regular" TD(0) up- 
dates performed at each step (while TTD always requires only one update per time step), 
and it does not allow the learner to take full advantage of TD(A > 0), which is applied only 
occasionally. 

Peng and Williams (1994) presented an alternative way of combining Q-learning and 
TD(A), different than discussed in Section 2.2. Their motivation was to better estimate TD 
returns by the use of TD errors. Toward that end, they used the standard Q-learning error 

r t + ymcLxQ t (x t+1 ,a) - Q t (x t , a t ) 

for one-step updates and a modified error 

r t + ymcLxQ t (x t+1 ,a) - maxQ t (x t , a), 

propagated using eligibility traces, thereafter. The TTD procedure achieves a similar ob- 
jective in a more straightforward way, by the use of truncated TD(A) returns. 

Other related work is that of Pendrith (1994). He applied the idea of eligibility traces in 
a non-standard way to estimate TD returns. His approach is more computationally efficient 
that the classical eligibility traces technique (it requires one prediction update per time 
step) and is free of the potentially harmful effect described by Equation 22. The method 
seems to be roughly equivalent to the TTD procedure with A = 1 and large to, though it is 
probably much more implementationally complex. 

5. Demonstrations 

The demonstrations presented in this section use the AHC variant of the TTD procedure. 
The reason is that the AHC algorithm is the simplest of the three described algorithms and 
its update rule for the evaluation function most directly corresponds to TD(A). Future work 
will investigate the TTD procedure for the two other algorithms. 

A tabular representation of the evaluation and policy functions is used. The abstract 
function update operation described by Equation 2 is implemented in a standard way as 

¥>(Po,Pi, • • -,Pn-i) ■■= ¥>(Po,Pi, • • -,Pn-i) + r}A. (27) 

Actions to execute at each step are selected using a simple stochastic selection mecha- 
nism based on a Boltzmann distribution. According to this mechanism, action a* is selected 
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in state x with probability 



1 ' )_ Ea a)/T)' ( ' 8j 



where the temperature T > adjusts the amount of randomness. 



5.1 The Car Parking Problem 

This section presents experimental results for a learning control problem with a relatively 
large state space and hard temporal credit assignment. We call this problem the car parking 
problem, though it does not attempt to simulate any real-world problem at all. Using words 
such as 'car', 'garage', or 'parking' is just a convention that simplifies problem description 
and the interpretation of results. The primary purpose of the experiments is neither just 
to solve the problem nor to provide evidence of the usefulness of the tested algorithm 
for any particular practical problem. We use this example problem in order to illustrate 
the performance of the AHC algorithm implemented within the TTD framework and to 
empirically evaluate the effects of different values of the TTD parameters A and m. 

The car parking problem is illustrated in Figure 3. A car, represented as a rectangle, 
is initially located somewhere inside a bounded area, called the driving area. A garage is 
a rectangular area of a size somewhat larger than the car. All important dimensions and 
distances are shown in the figure. The agent — the driver of the car — is required to park 
it in the garage, so that the car is entirely inside. The task is episodic, though it is neither 
a time-until-success nor time-until-failure task (in Sutton's (1984) terminology), but rather 
a combination of both. Each episode finishes either when the car enters the garage or when 
it hits a wall (of the garage or of the driving area). After an episode the car is reset to its 
initial position. 



5.1.1 State Representation 

The state representation consists of three variables: the rectangular coordinates of the center 
of the car, x and y, and the angle 9 between the car's axis and the x axis of the coordinate 
system. The orientation of the system is shown in the figure. The initial location and 
orientation of the car is fixed and described by x = 6.15 m, y = 10.47 m, and 9 = 3.7 rad. 
It was chosen so as to make the task neither too easy nor too difficult. 



5.1.2 Action Representation 

The admissible actions are 'drive straight on', 'turn left', and 'turn right'. The action of 
driving straight on has the effect of moving the car forward along its axis, i.e., without 
changing 9. The actions of turning left and right are equivalent to moving along an arc with 
a fixed radius. The distance of each move is determined by a constant car velocity v and 
simulation time step r. Exact motion equations and other details are given in Appendix A. 



5.1.3 Reinforcement Mechanism 

The design of the reinforcement function is fairly straightforward. The agent receives a 
reinforcement value of 1 (a reward) whenever it successfully parks the car in the garage, 
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Figure 3: The car parking problem. The scale of all dimensions is preserved: w = 2 m, 
/ = 4 m, xq = —1.5 m, xq = 1.5 m, x\ = 8.5 m, yo = —3 m, yo = 3 m, y\ = 13 m. 



and a reinforcement value of —1 (a punishment) whenever it hits a wall. At all other time 
steps the reinforcement is 0. That is, non-zero reinforcements are received only at the last 
step of each episode. This involves a relatively hard temporal credit assignment problem, 
providing a good experimental framework for testing the efficiency of the TTD procedure. 
The problem is hard not only because of reinforcement delay, but also because punishments 
are much more frequent than rewards: it is much easier to hit a wall than to park the car 
correctly. 

With such a reinforcement mechanism as presented above, an optimal policy for any 
< 7 < 1 is a policy that allows to park the car in the garage in the smallest possible 
number of steps. 
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5.1.4 Function Representation 

The car parking problem has a continuous state space. It is artificially discretized — divided 
into a finite number of disjoint regions by quantizing the three state variables, and then a 
function value for each region is stored in a look-up table. The quantization thresholds are: 

• for x: -0.5, 0.0, 0.5, 1.0, 2.0, 3.0, 4.0, 6.0 m, 

• for y: 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0 m, 

• for 9: i§7r, n, §±tt, . . ., |§7T, §tt, %tt rad. 

This yields 9 X 10 X 14 = 1260 regions. Of course many of them will never be visited. The 
threshold values were chosen so as to make the resulting discrete state space of a moderate 
size. The quantization is dense near the garage, and becomes more sparse as the distance 
from the garage increases. 



5.1.5 Experimental Design and Results 

Our experiments with applying the TTD procedure to the car parking problem are divided 
into two studies, testing the effects of the two TTD parameters A and m. The parameter 
settings for all experiments are presented in Table 2. The symbols a and (3 are used to 
designate the learning rates for the evaluation and policy functions, respectively. The initial 
values of the functions were all set to 0, since we assumed that no knowledge is available 
about expected reinforcement levels. 



Study 


TTD Parameters 


Learning Rates 


Number 


A 


m 


a 


(i 









0.7 


0.7 




0.3 




0.5 


0.5 




0.5 




0.5 


0.5 


1 


0.7 


25 


0.5 


0.5 




0.8 




0.5 


0.5 




0.9 




0.25 


0.25 




1 




0.25 


0.25 






5 


0.25 


0.25 






10 


0.25 


0.25 


2 


0.9 












15 


0.25 


0.25 






20 


0.25 


0.25 



Table 2: Parameter settings for the experiments with the car parking problem. 

As stated above, the experiments were designed to test the effects of the two TTD 
parameters. The other parameters were assigned values according to following principles: 

• the discount factor j was fixed and equal 0.95 in all experiments, 
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• the temperature value was also fixed and set to 0.02, which seemed to be equally good 
for all experiments, 

• the learning rates a and (3 were roughly optimized in each experiment. 8 

Each experiment continued for 250 episodes, the number selected so as to allow all or 
almost all runs of all experiments to converge. The results presented for all experiments 
are averaged over 25 individual runs, each differing only in the initial seed of the random 
number generator. This number was chosen as a reasonable compromise between the relia- 
bility of results and computational costs. The results are presented as plots of the average 
reinforcement value per time step for the previous 5 consecutive episodes versus the episode 
number. 

Study 1: Effects of A. The objective of this study was to examine the effects of various 
A values on learning speed and quality, with m set to 25. The value m = 25 was found to be 
large enough for all the tested A values (perhaps except A = l). 9 Smaller m values might be 
used for small A (in particular, m = 1 for A = 0), but it was kept constant for consistency. 



Reinf/Step Reinf/Step 




50 100 150 200 250 ' 50 100 150 200 250 

Episode Episode 



Figure 4: The car parking problem, learning curves for study 1. 

The learning curves for this study are presented in Figure 4. The observations can be 
briefly summarized as follows: 

• A = gives the worst performance of all (not all of 25 runs managed to converge 
within 250 episodes), 

• increasing A improves learning speed, 

• A values above or equal 0.7 are all similarly effective, greatly outperforming A = and 
clearly better than A = 0.5, 

8. The optimization procedure in most cases was as follows: some rather large value was tested in a few 
runs; if it did not give any effects of overtraining and premature convergence, it was accepted; otherwise 
a (usually twice) smaller value was tried, etc. 

9. Note that for A = 0.9, m = 25, and 7 = 0.95 we have (~{\) m « 0.02 < 0.855 = 7A. 
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• using large A caused the necessity of reducing the learning rates (cf. Table 2) to ensure 
convergence. 

The main result is that using large A with the TTD procedure (including 1) always 
significantly improved performance. It is not quite consistent with the empirical results of 
Sutton (1988), who found the performance of TD(A) the best for intermediate A, and the 
worst for A = 1. Lin (1993), who used A > for his experience replay experiments, reported 
A close to 1 as the most successful, similarly as this work. He speculated that the difference 
between his results and Sutton's might have been caused by switching occasionally (for 
non-policy actions) to A = in his studies. 10 Our results, obtained for A held fixed all the 
time 11 , suggest that this is not a good explanation. It seems more likely that the optimal A 
value simply strongly depends on the particular problem. Another point is that neither our 
TTD(1,25) nor Lin's implementation is exactly equivalent to TD(1). 

Study 2: Effects of m. This study was designed to investigate the effects of using several 
different m values for a fixed and relatively large A value. The best (approximately) A from 
study 1 was used, that is 0.9. The smallest tested m value is 5, which we find to be rather 
a small value. 12 

Reinf/Step 
0.04 

0.02 



-0.02 
-0.04 
-0.06 
-0.08 

50 100 150 200 250 
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Figure 5: The car parking problem, learning curves for study 2. 

The learning curves for this study are presented in Figure 5. The results for m = 25 
were taken from study 1 for comparison. The observations can be summarized as follows: 

• m = 5 is the worst and m = 25 is the best, 

• the differences between intermediate m values do not seem to be very statistically 
significant, 

10. As a matter of fact, non-policy actions were not replayed at all in Lin's experience replay experiments. 

11. Except for using A = for the most recent time step covered by the TTD return, as it follows from its 
definition (Equation 24). 

12. For 7 = 0.95, A = 0.9, and m = 5 we have (7A) m PS 0.457, which is by all means comparable with 
7 A = 0.855. 
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• even the smallest m = 5 gives the performance level much better than that obtained in 
study 1 for small A, i.e., even relatively small m values allow us to have the advantages 
of large A, though larger m values are generally better than small ones, 

The last observation is probably the most important. It is also very optimistic. It suggests 
that, at least in some problems, the TTD procedure with A > allows to obtain a significant 
learning speed improvement over traditional TD(0)-based algorithms with practically no 
additional costs, because for small m both space and time complexity induced by TTD is 
always negligible. 

5.2 The Cart-Pole Balancing Problem 

The experiments of this section have one basic purpose: to verify the effectiveness of the 
TTD procedure by applying its AHC implementation to a realistic and complex problem, 
with a long reinforcement delay, for which there exist many previous results for comparison. 
The cart-pole balancing problem, a classical benchmark of control specialists, is just such 
a problem. In particular, we would like to see whether it is possible to obtain performance 
(learning speed and the quality of the final policy) not worse than that reported by Barto 
et al. (1983) and Sutton (1984) using the eligibility traces implementation. 

Figure 6 shows the cart-pole system. The cart is allowed to move along a one-dimensional 
bounded track. The pole can move only in the vertical plane of the cart and the track. The 
controller applies either a left or right force of fixed magnitude to the cart at each time 
step. The task is episodic: each episode finishes when a failure occurs, i.e., the pole falls or 
the cart hits an edge of the track. The objective is to delay the failure as long as possible. 

The problem was realistically simulated by numerically solving a system of differential 
equations, describing the cart-pole system. These equations and other simulation details 
are given in Appendix B. All parameters of the simulated cart-pole system are exactly the 
same as used by Barto et al. (1983). 

5.2.1 State Representation 

The state of the cart-pole system is described by four state variables: 

• x — the position of the cart on the track, 

• x — the velocity of the cart, 

• 9 — the angle of the pole with the vertical, 

• 9 — the angular velocity of the pole. 

5.2.2 Action Representation 

At each step the agent controlling the cart-pole system chooses one of the two possible 
actions of applying a left or right force to the cart. The force magnitude is fixed and 
equal 10 N. 
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Figure 6: The cart-pole system. F is the force applied to the cart's center, / is a half of the 
pole length, and d is a half of the length of the track. 



5.2.3 Reinforcement Mechanism 

The agent receives non-zero reinforcement values (namely —1) only at the end of each 
episode, i.e., after a failure. A failure occurs whenever \0\ > 0.21 rad (the pole begins to 
fall) or \x\ > 2.4 m (the cart hits an edge of the track). Even at the beginning of learning, 
with a very poor policy, an episode may continue for hundreds of time steps, and there may 
be many steps between a bad action and the resulting failure. This makes the temporal 
credit assignment problem in the cart-pole task extremely hard. 

5.2.4 Function Representation 

As in the case of the car parking problem, we deal with the continuous state space of the 
cart-pole system by dividing it into disjoint regions, called boxes after Mitchie and Chambers 
(1968). The quantization thresholds are the same as used by Barto et al. (1983), i.e.: 

• for x: —0.8, 0.8 m, 

• for x: —0.5, 0.5 m/s, 

• for 0: -0.105, -0.0175, 0, 0.0175, 0.105 rad, 

• for 0: -0.8727, 0.8727 rad/s, 

which yields 3x3x6x3 = 162 boxes. For each box there is a memory location, storing a 
function value for that box. 
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5.2.5 Experimental Design and Results 

Computational expense prevented such extensive experimental studies as for the car parking 
problem. Only one experiment was carried out, intended to be a replication of the experi- 
ment presented by Barto et al. (1983). The values of the TTD parameters that seemed the 
best from the previous experiments were used, that is A = 0.9 and to = 25. The discount 
factor 7 was set to 0.95. The learning rates for the evaluation and policy functions were 
roughly optimized by a small number of preliminary runs and equal a = 0.1 and (3 = 0.05, 
respectively. The temperature of the Boltzmann distribution action selection mechanism 
was set to 0.0001, so as to give nearly-deterministic action selection. The initial values of 
the evaluation and policy functions were set to 0. We did not attempt to strictly replicate 
the same learning parameter values as in the work of Barto et al. (1983), since they used not 
only a different TD(A) implementation 13 , but also a different policy representation (based 
on the fact that there are only two actions, while our representation is general), action 
selection mechanism (for the same reasons), and function learning rule. 

The experiment consisted of 10 runs, differing only in the initial seed of the random 
number generator, and the presented results are averaged over those 10 runs. Each run con- 
tinued for 100 episodes. Some of individual runs were terminated after 500, 000 time steps, 
before completing 100 episodes. To produce reliable averages for all 100 episodes, fictious 
remaining episodes were added to such runs, with the duration assigned according to the 
following principle, used in the experiments of Barto et al. (1983). If the duration of the 
last, interrupted episode was less than the duration of the immediately preceding (com- 
plete) episode, the fictious episodes were assigned the duration of that preceding episode. 
Otherwise, the fictious episodes were assigned the duration of the last (incomplete) episode. 
This prevented any short interrupted episodes from producing unreliably low averages. The 
results are presented in Figure 7 as plots of the average duration (the number of time steps) 
of the previous 5 consecutive episodes versus the episode number, in linear and logarithmic 
scale. 

We can observe that TTD-based AHC achieved a similar (slightly better, to be exact) 
performance level, both as to learning speed and the quality of the final policy (i.e., the 
balancing periods), to that reported by Barto et al. (1983). The final balancing periods lasted 
above 130, 000 steps, on the average. It was obtained without using 162 additional memory 
locations for storing eligibility traces, and without the expensive computation necessary to 
update all of them at each time step, as well as all evaluation and policy function values. 

5.3 Computational Savings 

The experiments presented above illustrate the computational savings possible with the 
TTD procedure over conventional eligibility traces. A direct implementation of eligibility 
traces requires computation proportional to the number of states, i.e., to 1260 in the car 
parking task and to 162 in the cart-pole task — potentially many more in larger tasks. 
Even the straightforward iterative version of TTD may be then beneficial, as it requires 
computation proportional to to, which may be reasonably assumed to be many times less 

13. It was the eligibility traces implementation, but eligibility traces were updated by applying a somewhat 
different update rule than specified by Equation 8. In particular, they were discounted with A alone 
instead of 7A. Moreover, two different A values were used for the evaluation and policy functions. 
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Figure 7: The cart-pole balancing problem, learning curve in (a) linear and (b) logarithmic 
scale. 



than the size of the state space. Of course, the incremental version of TTD, which requires 
always very small computation independent of to, is much more efficient. 

In many practical implementations, to improve efficiency, eligibility traces and predic- 
tions are updated only for relatively few recently visited states. Traces are maintained only 
for the n most recently visited states, and the eligibility traces of all other states are assumed 
to be 0. 14 But even for this "efficient" version of eligibility traces, the savings offered by 
TTD are considerable. For a good approximation to infinite traces in such tasks as consid- 
ered here, n should be at least as large as m. For conventional eligibility traces, there will be 
always a concern for keeping n low, by reducing 7, A, or the accuracy of the approximation. 
The same problem occurs for iterative TTD, 15 but for incremental TTD, on the other hand, 
none of these are at issue. The same small computation is needed independent of m. 

6. Conclusion 

We have informally derived the TTD procedure from the analysis of the updates introduced 
by TD methods to the predicted utilities of states, and shown that they can be approxi- 
mated by the use of truncated TD(A) returns. Truncating temporal differences allows easy 
and efficient implementation. It is possible to compute TTD returns incrementally in con- 
stant time, irrespective of the value of m (the truncation period), so that the computational 
expense of using TD-based reinforcement learning algorithms with A > is negligible (cf. 
Equations 25 and 26). It cannot be achieved with the eligibility traces implementation. 
The latter, even for such function representation methods to which it is particularly well 



14. This modification cannot be applied when a parameter estimation function representation technique is 
used (e.g., a multi-layer perceptron), where traces are maintained for weights rather than for states. 

15. The relative computational expense of iterative TTD and the "efficient" version of eligibility traces 
depends on the cost of the function update operation, which is always performed only for one state by 
the former, and for n states by the latter. 
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suited (e.g., neural networks), is always associated with significant memory and time costs. 
The TTD procedure is probably the most computationally efficient (although approximate) 
on-line implementation of TD(A). It is also general, equally good for any function represen- 
tation method that might be used. 

An important question concerning the TTD procedure is whether its computational 
efficiency is not obtained at the cost of reduced learning efficiency. Having low computa- 
tional costs per control action may not be attractive if the number of actions necessary to 
converge becomes large. As for now, no theoretically grounded answer to this important 
question has been provided, though it is not unlikely that such an answer will eventually 
be found. Nevertheless, some informal consideration may suggest that the TTD-based im- 
plementation of TD methods not only does not have to perform worse than the classical 
eligibility traces implementation, but it can even have some advantages. As it follows from 
Equations 20, 21, and 22, using TD(0) errors for on-line TD(A) learning, as in the eligibility 
traces implementation, introduces an additional discrepancy term, whose influence on the 
learning process is proportional to the square of the learning rate. That term, though often 
negligible, may be still harmful in certain cases, especially in tasks where the agent is likely 
to stay in the same states for long periods. The TTD procedure, based on truncated TD(A) 
returns, is free of this drawback. 

Another argument supporting the TTD procedure is associated with using large A values, 
in particular 1. For an exact TD(A) implementation, such as that provided by eligibility 
traces, it means that learning relies solely on actually observed outcomes, without any regard 
to currently available predictions. It may be beneficial at the early stages of learning, when 
predictions are almost completely inaccurate, but in general it is rather risky — actual 
outcomes may be noisy and therefore sometimes misleading. The TTD procedure never 
relies on them entirely, even for A = 1, since it uses m-step TTD returns for some finite to, 
corrected by always using A = for discounting the predicted utility of the most recent step 
covered by the return (cf. Equation 17). This deviation of the TTD procedure from TD(A) 
may turn out to be advantageous. 

The TTD procedure using TTD returns for learning is only suitable for the implemen- 
tation of TD methods applied to reinforcement learning. This is because in RL a part of the 
predicted outcome is available at each step, as the current reinforcement value. However, 
it is straightforward to formulate another version of the TTD procedure, using truncated 
TD(A) errors instead of truncated TD(A) returns, that would cover the whole scope of 
applications of generic TD methods. 

The experimental results obtained for the TTD procedure seem very promising. The re- 
sults presented in Section 5.1 show that using large A with the TTD procedure can give a sig- 
nificant performance improvement over simple TD(0) learning, even for relatively small to. 
While it does not say anything about the relative performance of TTD and the eligibility 
traces implementation of TD(A), it at least suggests that the TTD procedure can be useful. 
The best results have been obtained for the largest A values, including 1. This observation, 
contradicting to the results reported by Sutton (1988), may be a positive consequence of 
the TTD procedure's deviation from TD(A) discussed above. 

The experiments with the cart-pole balancing problem supplied empirical evidence that 
for a learning control problem with a very long reinforcement delay the TTD procedure can 
equal or outperform the eligibility traces implementation of TD(A), even for a value of to 
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many times less than the average duration of an episode. This performance level is obtained 
with the TTD procedure at a much lower computational (both memory and time) expense. 

To summarize, our informal consideration and empirical results suggest that the TTD 
procedure may have the following advantages: 

• the possibility of the implementation of reinforcement learning algorithms that may 
be viewed as instantiations of TD(A), using A > for faster learning, 

• computational efficiency: low memory requirements (for reasonable m) and little com- 
putation per time step, 

• generality, compatibility with various function representation methods, 

• good approximation of TD(A) for A < 1 (or for A = 1 and j < 1), 

• good practical performance, even for relatively small m. 

There seems to be one important drawback: lack of theoretical analysis and a conver- 
gence proof. We do not know either what parameter values assure convergence or what 
values make it impossible. In particular, no estimate is available of the potential harmful 
effects of using too large m. Both the advantages and drawbacks cause that the TTD proce- 
dure is an interesting and promising subject for further work. This work should concentrate, 
on one hand, on examining the theoretical properties of this technique, and, on the other 
hand, on empirical studies investigating the performance of various TD-based reinforcement 
learning algorithms implemented within the TTD framework on a variety of problems, in 
particular in stochastic domains. 



Appendix A. Car Parking Problem Details 

The motion of the car in the experiments of Section 5.1 is simulated by applying at each 
time step the following equations: 

1. if r / then 

(a) 0(t + r) = 0(t)+Tf; 

(b) x(t + r) = x(t) - rsm8(t) + r sin 9{t + r); 

(c) y (t + r) = y (t) + r cos 9{t) - r sin 8(t + t); 

2. if r = then 

(a) 0(t + T ) = 0(t); 

(b) x(t + r) = x(t) + tv cos 9(t); 

(c) y(t + r) = y{t) + tv sin 0(t); 

where r is the turn radius, v is the car's velocity, and r is the simulation time step. In the 
experiments r = —5 m was used for the 'turn left' action, r = 5 m for 'turn right', and r = 
for 'drive straight on'. The velocity was constant and set to 1 m/s, and the simulation time 
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step t = 0.5 s was used. With these parameter settings, the shortest possible path from the 
car's initial location (x = 6.15 m, y = 10.47 m, 9 = 3.7 rad) to the garage requires 21 steps. 

At each step, after determining the current x, y, and 9 values, the coordinates of the 
car's corners are computed. Then the test for intersection of each side of the car with the 
lines delimiting the driving area and the garage is performed to determine whether a failure 
occurred. If the result is negative, the test is performed for each corner of the car whether 
it is inside the garage, to determine if a success occurred. 



Appendix B. Cart-Pole Balancing Problem Details 

The dynamics of the cart-pole system are described by the following equations of motion: 

F(t) + m p l \9 2 (t) sin 0(f) - 9 cos 9(t)] - /2 c sgnx(t) 



where 



x{t) 



gsm9(t) + cos 9(t) 



m c + to. 



-F(t)-m p W 2 (t) sin 6(t) + ^ c sgn x (t ) 
m c +m p 



ra p cos 2 6{t) 
m c +m p 



9 


= 9.8 m/s 2 - 


- acceleration due to gravity, 


m c 


= 1.0 kg 


- mass of the cart, 


m p 


= 0.1 kg 


- mass of the pole, 


I 


= 0.5 m - 


- half of the pole length, 




= 0.0005 


- friction coefficient of the cart on the track, 


VP 


= 0.000002 - 


- friction coefficient of the pole on the cart, 


F(t) 


= ±10.0 N - 


- force applied to the center of the cart at time t 



The equations were simulated using Euler's method with simulation time step r = 0.02 s. 



Acknowledgements 

I wish to thank the anonymous reviewers of this paper for many insightful comments. I was 
unable to follow all their suggestions, but they contributed much to improving the paper's 
clarity. Thanks also to Rich Sutton, whose assistance during the preparation of the final 
version of this paper was invaluable. 

This research was partially supported by the Polish Committee for Scientific Research 
under Grant 8 S503 019 05. 



References 

Baird, III, L. C. (1993). Advantage updating. Tech. rep. WL-TR-93-1146, Wright Labora- 
tory, Wright-Patterson Air Force Base. 

Barto, A. G. (1992). Reinforcement learning and adaptive critic methods. In White, D. A., 
& Sofge, D. A. (Eds.), Handbook of Intelligent Control, pp. 469-491. Van Nostrand 
Reinhold, New York. 

316 



Truncating Temporal Differences 



Barto, A. G., Sutton, R. S., & Anderson, C. (1983). Neuronlike adaptive elements that can 
solve difficult learning control problems. IEEE Transactions on Systems, Man, and 
Cybernetics, 13, 835-846. 

Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1990). Learning and sequential 
decision making. In Gabriel, M., & Moore, J. (Eds.), Learning and Computational 
Neuroscience. The MIT Press. 

Cichosz, P. (1994). Reinforcement learning algorithms based on the methods of temporal 
differences. Master's thesis, Institute of Computer Science, Warsaw University of 
Technology. 

Dayan, P. (1992). The convergence of TD(A) for general A. Machine Learning, 8, 341-362. 

Dayan, P., & Sejnowski, T. (1994). TD(A) converges with probability 1. Machine Learning, 
14, 295-301. 

Heger, M. (1994). Consideration of risk in reinforcement learning. In Proceedings of the 
Eleventh International Conference on Machine Learning (ML-94). Morgan Kaufmann. 

Jaakkola, T., Jordan, M. I., & Singh, S. P. (1993). On the convergence of stochastic iterative 
dynamic programming algorithms. Tech. rep. 9307, MIT Computational Cognitive 
Science. Submitted to Neural Computation. 

Klopf, A. H. (1982). The Hedonistic Neuron: A Theory of Memory, Learning, and Intelli- 
gence. Washington D.C.: Hempisphere. 

Lin, L.-J. (1992). Self-improving, reactive agents based on reinforcement learning, planning 
and teaching. Machine Learning, 8, 293-321. 

Lin, L.-J. (1993). Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis, 
School of Computer Science, Carnegie-Mellon University. 

Mitchie, D., & Chambers, R. A. (1968). BOXES: An experiment in adaptive control. 
Machine Intelligence, 2, 137-152. 

Moore, A. W., & Atkeson, C. G. (1992). An investigation of memory-based function ap- 
proximators for learning control. Tech. rep., MIT Artificial Intelligence Laboratory. 

Pendrith, M. (1994). On reinforcement learning of control actions in noisy and 
non-markovian domains. Tech. rep. UNSW-CSE-TR-9410, School of Computer Sci- 
ence and Engineering, The University of New South Wales, Australia. 

Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In Proceedings of the 
Eleventh International Conference on Machine Learning (ML-94). Morgan Kaufmann. 

Ross, S. (1983). Introduction to Stochastic Dynamic Programming. Academic Press, New 
York. 



317 



ClCHOSZ 



Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted re- 
wards. In Proceedings of the Tenth International Conference on Machine Learning 
(ML-93). Morgan Kaufmann. 

Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff markovian decision 
processes. In Proceedings of the Twelfth National Conference on Artificial Intelligence 
(AAAI-94). 

Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning. Ph.D. thesis, 
Department of Computer and Information Science, University of Massachusetts. 

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine 
Learning, 3, 9-44. 

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based 
on approximating dynamic programming. In Proceedings of the Seventh International 
Conference on Machine Learning (ML-90). Morgan Kaufmann. 

Sutton, R. S., Barto, A. G., & Williams, R. J. (1991). Reinforcement learning is direct 
adaptive optimal control. In Proceedings of the American Control Conference, pp. 
2143-2146. Boston, MA. 

Sutton, R. S., & Singh, S. P. (1994). On step-size and bias in temporal-difference learning. 
In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 
91-96. Center for Systems Science, Yale University. 

Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 
257-277. 

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, King's College, 
Cambridge. 

Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 
8, 279-292. 



318 



